knitr::opts_knit$set(root.dir = here::here())

In this file you will find the code we used to generate figures and tables for the paper A robust transcriptomic profile for Chronic Obstructive Pulmonary Disease through a meta-analysis. All code used in this analysis can be found in our Github repository.

Datasets

For the analysis we selected the datasets by following these steps:

knitr::include_graphics(here::here("paper/fig/DataSelection.png"))

1. Search in GEO

First, we serached datasets in Gene Eexpression Omnibus (GEO) from NCBI by using these terms: COPD, emphesema, chronic bronchitis, homo sapiens, GSE, microarray, RNAseq, NOT single cell on 17th March, 2021. For reproducibility purpose we provide the query used (added in supplementary table GEO_search):

Query: ((“chronic obstructive pulmonary disease” OR “COPD” OR “Emphysema” OR “chronic bronchitis”) AND (“Expression profiling by high throughput sequencing”[Filter] OR “Expression profiling by array”[Filter]) AND “Homo sapiens”[orgn] AND “gse”[Filter] NOT “single cell”

2. Download results

We downloaded the results and stored in /GEO_data/2021-03-17gds_result.txt

tail GEO_data/2021-03-17gds_result.txt
## FTP download: GEO (CEL) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE994/
## Series       Accession: GSE994   ID: 200000994
## 
## 168. Chronic obstructive pulmonary disease
## (Submitter supplied) Diaphragm muscles in Chronic Obstructive Pulmonary Disease (COPD) patients undergo an adaptive fast to slow transformation that includes cellular adaptations. This project studies the signaling mechanisms responsible for this transformation. Keywords: other
## Organism:    Homo sapiens
## Type:        Expression profiling by array
## Dataset: GDS289 Platform: GPL96 7 Samples
## FTP download: GEO (CEL) ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSEnnn/GSE475/
## Series       Accession: GSE475   ID: 200000475

3. Parsed information

Then, we parsed the information to have it in a table format by using /GEO_data/parsegeo.pl

cd GEO_data/
perl parsegeo.pl 2021-03-17gds_result.txt
##   168 entries found 
##   168 unique IDs found

The parsed table is in /GEO_data/021-03-17gds_result_tab.txt and added in Supplementary tables `

df <- read.delim(here::here("GEO_data/2021-03-17gds_result_tab.txt"), sep = "\t")
kable(df[1:2,])
No Description Extended.Description Organism Type.Source No..Datasets No..Series No..Rel..Platforms No..Samples Platform Series Dataset GEO.SRA FTP Type Accession ID
1 A transcriptomic atlas of mouse cerebellar cortex reveals novel cell types (Submitter supplied) The cerebellar cortex is a well-studied brain structure with diverse roles in motor learning, coordination, cognition, and autonomic regulation. Nonetheless, a complete inventory of cerebellar cell types is presently lacking. We used high-throughput transcriptional profiling to profile more than 600,000 nuclei and molecularly define cell types across individual lobules of the adult mouse cerebellum. more.. Homo sapiens; Mus musculus Expression profiling by high throughput sequencing NA NA NA NA NA NA NA NA ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE165nnn/GSE165371/ Series GSE165371 200165371
2 Human Ex Vivo Lung Perfusion: A Model System to study Lung diseases and direct targeted therapies (Submitter supplied) In the present study, we demonstrate the feasibility of the EVLP technique to perfuse lungs from deceased patients with a broad spectrum of chronic lung diseases. We were able to assess gene expression changes over the course of EVLP and between end-stage diseases by RNA sequencing transcriptomics from tissue samples collected at different time points, revealing fundamental genetic information which has important avenues for modulation in the future. more.. Homo sapiens Expression profiling by high throughput sequencing NA NA NA 26 GPL18573 NA NA CSV ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE146nnn/GSE146436/ Series GSE146436 200146436

We also added columns to help manual curation (the columns added were SampleType and Disease). The table is provided in Supplementary Tables Parsed_table.

type <- "LUNG TISSUE|BLOOD|MACROPHAGE.|BALF|SPUTUM|CELL|EPITHELIUM|MUSCLE|CULTURE|FIBROBLAST"
df$SampleType <- str_extract(str_to_upper(df$Extended.Description),type)
table(df$SampleType)
## 
##       BLOOD        CELL     CULTURE  EPITHELIUM  FIBROBLAST LUNG TISSUE 
##           9          43           1          28           1          18 
## MACROPHAGE  MACROPHAGES      MUSCLE      SPUTUM 
##           1          11           7           6
disease <- "CHRONIC OBSTRUCTIVE PULMONARY DISEASE|COPD|CANCER|ASTHMA"
df$Disease <-str_extract(str_to_upper(df$Extended.Description),disease) %>% 
  str_replace("CHRONIC OBSTRUCTIVE PULMONARY DISEASE", "COPD")
table(df$Disease)
## 
## ASTHMA CANCER   COPD 
##      8     15     87
#write_delim(df,"GEO_data/2021-03-17_Parsed_table.txt", delim = "\t")

4. Manual curation

Then, we manually curated all datasets to define sample type (e.i. Blood, lung tissue, sputum, cell lines, etc). This table is provided in Supplementary Tables as Manual_curation. Then, we selected “BLOOD” and “LUNG_TISSUE” datasets for a deeper manual curation.

knitr::include_graphics(here::here("paper/fig/PRISMA.png"))

Pre-processing

Data normalized is shown in figure bellow.

library(oligo)
library(affy)
library(SummarizedExperiment)

norm_plots <- function(name,x,plot, mfrow = c(2, 3),...){
  pdf(here::here(name),height = 6,width = 10)
  par(mfrow = mfrow ,mar=c(2,1,1,1))
  for (i in 1:length(x)){
    if(plot == "hist" & class(x[[i]]) =="ExpressionSet"){
        print(paste0(gsub(".GSE.*","",names(x)[i]), " ExpressionSet"))
        hist(x[[i]], main = gsub(".GSE.*","",names(x)[i]))
      }
    if(plot == "boxplot" & class(x[[i]]) =="ExpressionSet"){
        print(paste0(gsub(".GSE.*","",names(x)[i]), " ExpressionSet" ))
        boxplot(x[[i]], main = gsub(".GSE.*","",names(x)[i]))
      }
    if(plot == "hist" & class(x[[i]]) !="ExpressionSet"){
      print(gsub(".GSE.*","",names(x)[i]))
      #hist(assay(lung_norm[[i]]),100)
      plotDensity(log2(assay(lung_norm[[i]])+1), main = gsub(".GSE.*","",names(x)[i]))
    }
    if(plot == "boxplot" & class(x[[i]]) !="ExpressionSet"){
      print(gsub(".GSE.*","",names(x)[i]))
      #hist(assay(lung_norm[[i]]),100)
      boxplot(log2(assay(lung_norm[[i]])+1), main = gsub(".GSE.*","",names(x)[i]))
    }
  }
  dev.off()
}

For blood datasets:

## Read data  
blood_norm <- readRDS(here::here("Blood/output_data/Blood_2020-11-25_Step3_LungTissue-CURATED.RDS"))

#Plot images
norm_plots(name = "paper/fig/blood-norm_hist.pdf", x = blood_norm, plot = "hist")
## [1] "GSE112811 ExpressionSet"
## [1] "GSE56766 ExpressionSet"
## [1] "GSE42057 ExpressionSet"
## [1] "GSE94916 ExpressionSet"
## [1] "GSE76705 ExpressionSet"
## quartz_off_screen 
##                 2
norm_plots(name = "paper/fig/blood-norm_boxplot.pdf", x = blood_norm, plot = "boxplot")
## [1] "GSE112811 ExpressionSet"
## [1] "GSE56766 ExpressionSet"
## [1] "GSE42057 ExpressionSet"
## [1] "GSE94916 ExpressionSet"
## [1] "GSE76705 ExpressionSet"
## quartz_off_screen 
##                 2
knitr::include_graphics(here::here("paper/fig/blood-norm_hist.pdf"))
knitr::include_graphics(here::here("paper/fig/blood-norm_boxplot.pdf"))

For lung datasets:

## Read data  
lung_norm <- readRDS(here::here("Tissue/output_data/2021-03-30_Step3_LungTissue-CURATED.RDS"))
lung_norm <- lapply(lung_norm, function(x){x[[1]]})

#Plot images
norm_plots(name = "paper/fig/lung-norm_hist.pdf", x = lung_norm, plot = "hist", mfrow = c(4, 3))
## [1] "GSE27597 ExpressionSet"
## [1] "GSE106986 ExpressionSet"
## [1] "GSE37768 ExpressionSet"
## [1] "GSE57148"
## [1] "GSE47460 ExpressionSet"
## [1] "GSE38974 ExpressionSet"
## [1] "GSE8581 ExpressionSet"
## [1] "GSE1650 ExpressionSet"
## [1] "GSE1122 ExpressionSet"
## [1] "GSE119040 ExpressionSet"
## [1] "GSE103174 ExpressionSet"
## [1] "GSE124180"
## quartz_off_screen 
##                 2
norm_plots(name = "paper/fig/lung-norm_boxplot.pdf", x = lung_norm, plot = "boxplot", mfrow = c(4, 3))
## [1] "GSE27597 ExpressionSet"
## [1] "GSE106986 ExpressionSet"
## [1] "GSE37768 ExpressionSet"
## [1] "GSE57148"
## [1] "GSE47460 ExpressionSet"
## [1] "GSE38974 ExpressionSet"
## [1] "GSE8581 ExpressionSet"
## [1] "GSE1650 ExpressionSet"
## [1] "GSE1122 ExpressionSet"
## [1] "GSE119040 ExpressionSet"
## [1] "GSE103174 ExpressionSet"
## [1] "GSE124180"
## quartz_off_screen 
##                 2
knitr::include_graphics(here::here("paper/fig/lung-norm_hist.pdf"))
knitr::include_graphics(here::here("paper/fig/lung-norm_boxplot.pdf"))

Summary of data

source(here::here("Lung-Blood/scripts/sample_size.R"))
knitr::include_graphics(here::here("Lung-Blood/fig/sample_size.pdf"))